Head and Neck Squamous Cell Carcinoma (HNSCC) is a heterogeneous group of tumors arising from the mucosal epithelium of the oral cavity, pharynx, and larynx. HNSCC is typically stratified by the presence of Human Papilloma Virus (HPV). Patients with HPV + HNSCC have a better prognosis than HPV - patients. This analysis aims to identify differentially expressed genes (DEGs) between HPV-positive and HPV-negative HNSCC samples using RNA-seq data from NCBI GEO (GSE72536). We performed quality assessment, preprocessing, batch effect correction, differential expression analysis, and functional enrichment analysis to elucidate the molecular differences associated with HPV status in HNSCC. We identified Genes like SYCP2, STAG3, TEX15, ZNF541, and ZPBP2 are significantly upregulated in HPV + tumors. GO and KEGG enrichment analyses revealed significant enrichment in immune-related processes and pathways.These findings suggest HPV-associated HNSCC mounts a more robust immune response compared to HPV - HNSCC. The molecular interactions behind this can be explored in the future and may inform future therapeutic strategies.
Head and neck squamous cell carcinoma (HNSCC) is a heterogenous group of tumors at many anatomical sites in the oral cavity, pharynx and larynx originating from mucosal epithelium (Ferlay et al. 2019). HNSCC is the 6th most common cancer worldwide, with a median diagnosis of 66 years, and represents ~90% of head and neck cancers (“Head and Neck Squamous Cell Carcinoma | Nature Reviews Disease Primers,” n.d.). HNSCC accounts for more than 600,000 new cases and 350,000 deaths annually. Tobacco smoking, alcohol use, and areca nuts are major risk factors in developing HNSCC. Cancer arising in the oral cavity or larynx are typically associated with tobacco and alcohol use, whereas Oropharyngeal tumors- affecting the tonsil, back 1/3 of the tongue, and throat- are almost always HPV positive. HPV positive HNSCC is associated with high-risk HPV subtypes, namely HPV16, and is associated with better outcomes when compared to HPV negative HNSCC (Michaud et al. 2014).
The different anatomical site and etiology of HPV positive and negative HNSCC suggest that these two diseases have different molecular characteristics. Understanding the molecular differences between HPV positive and negative HNSCC is crucial for developing targeted therapies and improving patient outcomes.This analysis aims to identify differentially expressed genes (DEGs) between HPV positive and HPV negative HNSCC samples using RNA-seq data from NCBI GEO.The data set from Wood, O, et al. (GSE72536) and the manuscript “Gene expression analysis of TIL rich HPV driven head and neck tumors reveals a distinct B-cell signature when compared to HPV independent tumours” was used (Wood et al. 2016). Here, we performed quality assessment, preprocessing, batch effect correction, differential expression analysis, and functional enrichment analysis to elucidate the molecular differences associated with HPV status in HNSCC.
Quality assessment
Raw counts were inspected for library size, count distribution, sample correlation, and principal components. Library sizes were computed as column sums of the raw count matrix and visualized with a barplot. Per‑sample count distributions were visualized as log2‑CPM density plots and boxplots. Sample–sample Pearson correlations were computed on log2‑CPM and displayed as a heatmap. Principal component analysis (PCA) was performed on the top variable genes (top 20% by variance) and plotted to inspect sample clustering and potential confounders.
Preprocessing
Genes with low expression were filtered using edgeR::filterByExpr() (Robinson, McCarthy, and Smyth 2010) to retain genes with sufficient counts for statistical testing. TMM normalization was applied with edgeR::calcNormFactors(). For visualization, log2‑CPM with a prior count of 1 was used. Gene identifiers were mapped to Entrez IDs using AnnotationDbi::mapIds(org.Hs.eg.db)(Carlson 2019)
Batch effect correction
Batch variables were unknown so surrogate variables were estimated with sva::svaseq()(Leek and Storey 2007) and included surrogate variables in the differential expression model; surrogate variables were also regressed out for visualization. PCA plots and hierarchical clustering (top 10% most variable genes) are shown before and after correction.
Differential expression analysis
Differential expression was performed with edgeR (Robinson, McCarthy, and Smyth 2010) using a generalized linear model. The design matrix included the primary factor (HPV status) and surrogate variables estimated by SVA (Leek and Storey 2007). Dispersion was estimated with estimateDisp(), models were fit with glmFit(), and tests were performed with glmLRT(). P‑values were adjusted for multiple testing using the Benjamini–Hochberg procedure. Genes with FDR < 0.05 and |log2FC| ≥ 1 were considered significant. Top differentially expressed genes and boxplots for the top 10 genes are shown.
Functional enrichment analysis
Gene Ontology enrichment (BP, MF, CC) and KEGG pathway enrichment were performed using clusterProfiler (Yu et al. 2012). Over‑representation analysis (ORA) used enrichGO/enrichKEGG on the set of significant genes (FDR < 0.05, |log2FC| ≥ 1) with the tested gene universe defined as all mapped Entrez IDs. Gene set enrichment analysis (GSEA) used a ranked gene list by logFC and FDR. Multiple testing correction used Benjamini–Hochberg; terms with adjusted p < 0.05 were reported. KEGG queries used the KEGG REST API. Visualizations include dotplots and pathview (Luo and Brouwer 2013) for KEGG pathway diagram.
Figure 1: Figure 1a
Overview of BIOS658 project.
Figure 2: Figure 1b
Library size distribution across samples.
Figure 3: Figure 1c
Log2-CPM density plot across samples.
Figure 4: Figure 1d
Log2-CPM boxplot across samples.
Figure 5: Figure 1e
Sample–sample correlation heatmap.
Principal component analysis (PCA) and hierarchical clustering were performed to assess batch effects and sample clustering based on HPV status. First, PCA plots before and after batch effect correction are shown in Figure 2a and 2b, respectively. Before correction, samples clustered primarily by HPV status, indicating that biological variation was the dominant source of variation. After SVA correction, the clustering pattern remained similar, suggesting that batch effects were minimal and did not confound the biological signal.
Figure 6: Figure 2a
PCA plot before batch correction.
Figure 7: Figure 2b
PCA plot after batch correction.
Figure 8: Figure 2c
Hierarchical clustering before correction.
Figure 9: Figure 2d
Hierarchical clustering after correction.
Figure 10: Figure 3a
MA plot of differential expression.
Figure 11: Figure 3b
Heatmap of top 50 DE genes.
Figure 12: Figure 3c
Volcano plot of DE results.
Figure 13: Figure 3d
Boxplots of top 10 DE genes.
Figure 14: Figure 4a
GO Biological Process enrichment .
Figure 15: Figure 4b
GO Molecular Function enrichment (ORA).
Figure 16: Figure 4c
GO Cellular Component enrichment (ORA).
Figure 17: Figure 4d
KEGG pathway enrichment (ORA).
Figure 18: Figure 5
KEGG pathway map (Antigen processing and presentation, hsa04612).
In this study, we analyzed RNA-seq data from HPV positive and negative head and neck squamous cell carcinoma (HNSCC) samples to identify differentially expressed genes (DEGs) associated with HPV status. Quality assessment confirmed data integrity. Bar graphs and box plots of library size and log2-CPM were performed and revealed relatively uniform counts. There was a range of ~ 10 million counts between the high and low samples, but no outliers were detected. This is expected as the NCBI GEO data set GSE72536 was already filtered by the authors prior to uploading. The authors exluded 16 samples in the original study (39 HNSCC samples down to 23 samples), selecting for TIL-rich tumors.This could explain any differences in results, due to a second layer of quality control. Nonetheless, samples were originally selected to be immune rich, and as expected, we see an increase in immune-related GO terms after GO analysis.Next, we assessed batch effects using PCA and hierarchical clustering. In this analysis, PCA plots before and after batch effect correction were performed as part of a standard pipeline. Before SVA correction, the PCA plot (Figure 2a) showed strong clustering around HPV status that remained after SVA correction, indicating that HPV status was a major source of variation in the data. Hierarchical clustering (Figure 2c and 2d) further supported this, with samples clustering primarily by HPV status rather than batch. This suggests that while batch effects were present, they were minor and did not defer from the biological question.
Differential expression analysis using edgeR identified a substantial number of DEGs between HPV positive and negative samples. The MA plot (Figure 3a) illustrated a clear separation of upregulated and downregulated genes across a range of expression levels. This suggest that differences between HPV + and HPV - samples are not due to a handful of highly expressed genes. The heatmap of the top 50 DEGs (Figure 3b) demonstrated distinct expression patterns between the two groups,and The volcano plot (Figure 3c) highlighted several genes with both high fold changes and statistical significance, including SYCP2, STAG3, TEX15, ZNF541, and ZPBP2 as significantly upregulated in HPV positive tumors. Boxplots of the top 10 DEGs (Figure 3d) confirmed consistent expression differences between HPV positive and negative samples.Wood et al, also showed a strong seperation of HPV+ and HPV- samples based on DEGs. They noted an immune signature enrichment in HPV+ samples, which is consistent with our findings. Despite this, there are some differences in the specific DEGs identified, likely due to differences in analysis pipelines and filtering criteria. Wood, et al. key finding was a distinct B cell signature in HPV+ tumors, which included CD200, ADAM28, BCL2, VCAM1, and STAG3. In their pipeline, they did not use SVA for batch correction, instead they added CD19,CD4 and CD8 as covariates. This could explain some of the differences in DEGs identified. Nevertheless, some of the top GO terms from our analysis align with Wood et al’s findings, particularly those related to immune processes.
Woods, et al. revealed distinct B cell signature in HPV+ tumors, which included CD200, ADAM28, BCL2, VCAM1, and STAG3, when TILs were corrected for. This analysis did not recreate the original paper, but provided further synergistic evidence that HPV+ tumors have a distinct immune microenvironment compared to HPV- tumors, when an additional layer of QC and filtering was applied. Together, these results highlight the connection between HPV status, Immune interaction, and better prognosis in HNSCC. Further research is required to elucidate the mechanistic links and therapeutic implications of these findings.
```